This class heavily relies on the interactive online textbook, CourseKata. To successfully run any code you learn in class, you need to load the coursekata package first. Remember to run this line of code at the top of any script you write in this class!!

library(coursekata) # run install.packages("coursekata") on a new machine

Otherwise, you will get an error message that says could not find function....

Part I - Exploratory Data Analysis

Inspect Your Data

Load a Dataset

We will work on a sample of the US population data, mydata, from the Census. First, use read.csv() to load the data from a CSV file.

mydata <- read.csv("link_to_my_dataframe.csv")

After loading the dataset, you can inspect it by using functions like head(), str(), glimpse(), or summary().

head(mydata) # gives you the first several rows
##   age     education gender native_country income
## 1  39     Bachelors   Male  United-States  26819
## 2  50     Bachelors   Male  United-States  28650
## 3  38   High School   Male  United-States  40974
## 4  53 No HS Diploma   Male  United-States  30426
## 5  28     Bachelors Female           Cuba  30786
## 6  37       Masters Female  United-States  42276
str(mydata) # gives you the structure of the data
## 'data.frame':    32561 obs. of  5 variables:
##  $ age           : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ education     : Factor w/ 5 levels "Bachelors","Doctorate",..: 1 1 3 5 1 4 5 3 4 1 ...
##  $ gender        : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ native_country: Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
##  $ income        : chr  "26819" "28650" "40974" "30426" ...
summary(mydata) # gives you the summary statistics of the data
##       age                education        gender            native_country 
##  Min.   :17.00   Bachelors    : 5355   Female:10771   United-States:29170  
##  1st Qu.:28.00   Doctorate    :  413   Male  :21790   Mexico       :  643  
##  Median :37.00   High School  :20241                  ?            :  583  
##  Mean   :38.58   Masters      : 2299                  Philippines  :  198  
##  3rd Qu.:48.00   No HS Diploma: 4253                  Germany      :  137  
##  Max.   :90.00                                        Canada       :  121  
##                                                       (Other)      : 1709  
##     income         
##  Length:32561      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Recode a Variable

Using common sense, education, sex, and native_country should be categorical variables. age and income should be quantitative. Yet from str() we see that income is mistakenly coded as characters. Let’s convert it to a numeric variable. Recall that, to call a variable in R, you need to use the $ sign: data$variable.

mydata$income <- as.numeric(mydata$income)
class(mydata$income) # it is now numeric
## [1] "numeric"

In R, recode() is a quick way to recode a variable. For example, we can recode the “?” values in native_country to “NA” (missing in R).

mydata$native_country <- recode(mydata$native_country, "?"= NA_character_) # old_value = new_value

Create a New Variable

Let’s create a new variable to record the income level (categorical) based on the income variable. There are several ways to do this.

mydata$is_low_income <- mydata$income < 50000 # we create a new variable `is_low_income`, which is TRUE if `income` <30000 and FALSE otherwise. 

mydata$income_level <- ifelse(mydata$income > 50000, "high", "low") # we create a new variable `income_level`, which is "high" if `income` > 50000 and "low" otherwise. 

mydata$income_group <- factor(ntile(mydata$income, 3), labels = c("low", "medium", "high")) # we create a new variable `income_group`, which divides people into 3 groups, based on the distribution of `income`: low, medium, and high. 

# now let's check the new variables we created.
head(mydata, 10)
##    age     education gender native_country income is_low_income income_level
## 1   39     Bachelors   Male  United-States  26819          TRUE          low
## 2   50     Bachelors   Male  United-States  28650          TRUE          low
## 3   38   High School   Male  United-States  40974          TRUE          low
## 4   53 No HS Diploma   Male  United-States  30426          TRUE          low
## 5   28     Bachelors Female           Cuba  30786          TRUE          low
## 6   37       Masters Female  United-States  42276          TRUE          low
## 7   49 No HS Diploma Female        Jamaica  32897          TRUE          low
## 8   52   High School   Male  United-States  89878         FALSE         high
## 9   31       Masters Female  United-States  92414         FALSE         high
## 10  42     Bachelors   Male  United-States  40153          TRUE          low
##    income_group
## 1           low
## 2           low
## 3          high
## 4        medium
## 5        medium
## 6          high
## 7        medium
## 8          high
## 9          high
## 10         high

Filter Data

Sometimes we need to filter the data to focus on a specific group. For example, we can filter the data to exclude observations with missing native_country.

mydata_noNA <- mydata %>% filter(native_country != "NA") # we use `filter()` to exclude observations with missing `native_country`.

mydata_noNA <- mydata %>% filter(!is.na(native_country)) # an easier way to do it

Similarly, we can filter the data to include only observations with income > $50000. The %>% sign is a pipe operator that passes the result of the previous command to the next command.

mydata_high_income <- mydata %>% filter(income > 50000)
head(mydata_high_income)
##   age   education gender native_country income is_low_income income_level
## 1  52 High School   Male  United-States  89878         FALSE         high
## 2  31     Masters Female  United-States  92414         FALSE         high
## 3  37 High School   Male  United-States 122256         FALSE         high
## 4  30   Bachelors   Male          India  99868         FALSE         high
## 5  40 High School   Male           <NA>  84153         FALSE         high
## 6  43     Masters Female  United-States  79537         FALSE         high
##   income_group
## 1         high
## 2         high
## 3         high
## 4         high
## 5         high
## 6         high

Sometimes we only need a subset of the data. We can use the sample() function to randomly select a subset of the data.

sample1 <- sample(mydata, 100, replace = FALSE) # without replacement
nrow(sample1)
## [1] 100

Examining the Distribution of a Variable

Before you examine the distribution of a variable, you need to know whether it is categorical or quantitative (numerical).

Distribution of a Categorical Variable

For a categorical variable (such as gender, color, political affiliation), we use tally() to get a frequency table, gf_bar to get a bar plot, or gf_percents to get a percentage plot.

tally(mydata$education) # gives you the frequency of each category.
## X
##     Bachelors     Doctorate   High School       Masters No HS Diploma 
##          5355           413         20241          2299          4253
tally(mydata$education, format = "percent") # gives you the percentage of each category.
## X
##     Bachelors     Doctorate   High School       Masters No HS Diploma 
##     16.446055      1.268389     62.163324      7.060594     13.061638
gf_bar(~education, data = mydata) # creates a bar plot: y-axis is the count.

gf_percents(~education, data = mydata) # creates a percent plot: y-axis is the percentage.

Distribution of a Quantitative Variable

For a numerical variable (such as age, GDP, income), we can use favstats() to get its summary statistics, or create a density plot, histogram, or boxplot to visualize its distribution.

What are the minimum, maximum, median, mean, IQR, and range of income?

favstats(mydata$income)
##    min    Q1 median    Q3    max     mean       sd     n missing
##  14761 27425  32462 43989 399502 45410.69 32767.73 32561       0

Now we can visualize the distribution of income.

gf_density(~income, data = mydata) # creates a density plot.

gf_histogram(~income, data = mydata) # creates a histogram.

gf_boxplot(~income, data = mydata) # creates a boxplot, which contains more details like median and range. 

gf_boxplot(income ~ ., data = mydata) # creates a boxplot (vertical). The dot is a placeholder. 

The distribution is skewed right. Lots of people earn less than $50,000, and a small proportion of people earn much more.

Examining the Distribution of Two Variables

A Quantitative Variable and A Categorical Variable

If we are interested in the distribution of a numerical variable (income) across different categories (gender):

favstats(income ~ gender, data = mydata) # gives you the summary statistics of `income` across different `gender`.
##   gender   min       Q1 median       Q3    max     mean       sd     n missing
## 1 Female 15157 26747.00  30877 36626.00 332992 37383.89 23859.92 10771       0
## 2   Male 14761 27863.25  33594 58985.75 399502 49378.41 35714.31 21790       0

Visualizing the distribution of income across different gender:

gf_histogram(~income, data = mydata) %>% gf_facet_grid(gender ~ .) # creates a histogram of `income` across different `gender`.

gf_histogram(~income, fill = ~gender, data = mydata) # creates a histogram of `income` across different `gender`, with different colors.

gf_boxplot(income ~ gender, data = mydata) # creates a boxplot of `income` across different `gender`.

gf_jitter(income ~ gender, data = mydata) # creates a jitter plot of `income` across different `gender`.

Note that in this class, we will not cover the case the other way around (a categorical variable across different numerical variables).

Two Categorical Variables

If we are interested in the distribution of two categorical variables, like gender and education:

tally(gender ~ education, data = mydata) # gives you the frequency of each combination of `gender` and `education`.
##         education
## gender   Bachelors Doctorate High School Masters No HS Diploma
##   Female      1619        86        7117     628          1321
##   Male        3736       327       13124    1671          2932
gf_bar(~education, data = mydata) %>% gf_facet_grid(. ~ gender) # creates a bar plot of `education` across different `gender`.

gf_percents(~education, data = mydata) %>% gf_facet_grid(. ~ gender) # creates a percentage plot of `education` across different `gender`.

Two Quantitative Variables

If we are interested in the relationship between two numerical variables, like income and age:

gf_point(income ~ age, data = mydata) # creates a scatter plot of `income` and `age`.

gf_point(log(income) ~ age, data = mydata) # creates a scatter plot of logged `income` and `age`. Note that the y-axis is now in log scale.

gf_jitter(income ~ age, data = mydata) # creates a jitter plot of `income` and `age`.

Part II - Regression Models

Empty Model

The simplest model is the empty model, where we use the mean of a quantitative variable to model its distribution. In a word form: Income = Mean + Error. In a general linear model form: \(\text{Income}_i = b_0 + e_i\). DO NOT FORGET the error term \(e_i\)!!

# create a histogram of `income` and add a vertical line at the mean of `income`.
gf_histogram(~income, data = mydata) %>% 
  gf_vline(xintercept = ~mean, data = favstats(~income , data = mydata), color ="darkred") 

# create an empty model
empty_model <- lm(income ~ NULL, data = mydata)

# Visualize the empty model prediction
gf_histogram(~income, data = mydata) %>% gf_model(empty_model) # the same as the previous plot

# get the predictions from the empty model
mydata$empty_predicted <- predict(empty_model)

# check your model, it only has one intercept, which equals the mean of `income`. 
empty_model
## 
## Call:
## lm(formula = income ~ NULL, data = mydata)
## 
## Coefficients:
## (Intercept)  
##       45411
mean(mydata$income)
## [1] 45410.69

How far off are we?

Let’s check some statistics to evaluate the model.

# data_i = b_0 (mean) + error_i (residual)
# to get the residuals
mydata$empty_residuals <- resid(empty_model)
mean(mydata$empty_residuals) # should be 0
## [1] -4.086115e-11
# calculate the SAD (sum of absolute deviations/residuals/errors)
sum(abs(mydata$empty_residuals))
## [1] 742842665
# calculate the SSE (sum of squared errors)
sum(mydata$empty_residuals^2)
## [1] 3.496045e+13
# calculate the variance (mean squared error; MSE)
var(mydata$income) # var() function 
## [1] 1073723810
sum(mydata$empty_residuals^2) / (nrow(mydata)-1) # variance = SSE/(n-1)
## [1] 1073723810
# calculate the SD (standard deviation)
sd(mydata$income) # sd() function
## [1] 32767.73
sqrt(var(mydata$income)) # SD = sqrt(variance), the square root of the variance
## [1] 32767.73

Simple Linear Model

The simple linear model is a regression model with one predictor. We refer to the X as the independent (explanatory) variable and Y as the dependent (response) variable.

Fit the model

First, let’s explain variation in income using age, a continuous variable. In a word form: Income = Age + Error. In a general linear model form: \(\text{Income}_i = b_0 + b_1*\text{Age}_i + e_i\) (plug in the estimated b0 and b1 we get).

age_model <- lm(income ~ age, data = mydata) # explain income using age
summary(age_model) # get the summary of the model
## 
## Call:
## lm(formula = income ~ age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50676 -17613 -10443   1180 356172 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27889.79     534.98   52.13   <2e-16 ***
## age           454.13      13.07   34.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32180 on 32559 degrees of freedom
## Multiple R-squared:  0.03574,    Adjusted R-squared:  0.03571 
## F-statistic:  1207 on 1 and 32559 DF,  p-value: < 2.2e-16
b0(age_model) # get the intercept
## [1] 27889.79
b1(age_model) # get the slope
## [1] 454.1253
# let's create a scatter plot of `income` and `age`, and add the regression line.
gf_point(income ~ age, data = mydata) %>% gf_model(age_model, color = "darkred")

What does the b0 and b1 mean here? The b0 is the intercept, which is the predicted value of income when age is 0 (hypothetically speaking). The b1 is the slope, which is the change in income per a one-unit change in age. Here it means, for each additional year of age, income increases by $454.1252595.

Second, let’s explain variation in income using gender, a categorical variable. In a word form: Income = Gender + Error. In a general linear model form: \(\text{Income}_i = b_0 + b_1*\text{Gender}_i + e_i\).

gender_model <- lm(income ~ gender, data = mydata) # explain income using gender
summary(gender_model) # get the summary of the model
## 
## Call:
## lm(formula = income ~ gender, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34617 -19007 -10845   1491 350124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37383.9      311.0  120.20   <2e-16 ***
## genderMale   11994.5      380.2   31.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32280 on 32559 degrees of freedom
## Multiple R-squared:  0.02966,    Adjusted R-squared:  0.02963 
## F-statistic: 995.3 on 1 and 32559 DF,  p-value: < 2.2e-16
# For a categorical variable, we have a reference level. Here, the reference level is Female. We can tell it from the summary output -- the slope is called `genderMale`. 
b0(gender_model) # get the intercept
## [1] 37383.89
b1(gender_model) # get the slope
## [1] 11994.52
# let's create a histogram of `income` and `gender`, and add the regression line.
gf_histogram(~income, data = mydata) %>% gf_facet_grid(gender ~ .) %>%
  gf_model(gender_model) %>% # gender model: mean income for female and male, separately.
  gf_model(empty_model, color = "darkred") # empty model: mean income for all.

What does the b0 and b1 mean here? The b0 is the intercept, which is the mean income for the reference level (here it is Female). The b1 is the slope, which is the difference in income between the reference level and the other level (here it is Male). Here it means, the average income of Male is $1.1994522^{4} higher than female.

ANOVA Table

We have “Total = Model + Error” in the ANOVA table. The Model is the variation in income explained by the model, and the Error is the variation in income not explained by the model.

supernova(age_model) # ANOVA table for the age model
##  Analysis of Variance Table (Type III SS)
##  Model: income ~ age
## 
##                                    SS    df           MS        F   PRE     p
##  ----- --------------- | ------------ ----- ------------ -------- ----- -----
##  Model (error reduced) | 1.249373e+12     1 1.249373e+12 1206.675 .0357 .0000
##  Error (from model)    | 3.371107e+13 32559 1.035384e+09                     
##  ----- --------------- | ------------ ----- ------------ -------- ----- -----
##  Total (empty model)   | 3.496045e+13 32560 1.073724e+09
  • The 1st column, SS, refers to the sum of squares. For example, Sum of Squared Errors (SSE) is the SS value in the Error row, which is 3.3711075^{13}.
  • The 2nd column, df, refers to the degrees of freedom, which depends on the number of observations and variables.
  • The 3rd column, MS, refers to the mean square. MS is calculated by dividing SS by df: for example, MSE = SSE/df_Error.

Compare two models

Let’s compare the Age model and Gender model using PRE and F-ratio. Let’s also print out the ANOVA table for the gender model.

supernova(gender_model) # ANOVA table for the gender model
##  Analysis of Variance Table (Type III SS)
##  Model: income ~ gender
## 
##                                    SS    df           MS       F   PRE     p
##  ----- --------------- | ------------ ----- ------------ ------- ----- -----
##  Model (error reduced) | 1.037006e+12     1 1.037006e+12 995.297 .0297 .0000
##  Error (from model)    | 3.392344e+13 32559 1.041907e+09                    
##  ----- --------------- | ------------ ----- ------------ ------- ----- -----
##  Total (empty model)   | 3.496045e+13 32560 1.073724e+09
  • The 4th column, F, refers to the F-ratio. It is calculated as F = MS_Model / MS_Error. It represents how much variation is explained by the model per degree of freedom (the variance explained by the model is F-ratio times larger than the leftover variance unexplained by the model). It can be higher than 1. We prefer model with higher F-ratio.
  • The 5th column, PRE, refers to the PRE (Proportional Reduction in Error). It is calculated as PRE = SS_Model / SS_Total. It measures how much variation in Y is explained by X (from 0 to 1), in other words, how much error is reduced. We prefer model with higher PRE.

Inference

A sampling distribution is a distribution of sample statistics, computed from randomly generated samples of the same size. We can use shuffle() or resample for the random data generating process (DGP):

# shuffle the income: create a new empty model 
mydata$shuffled_income <- shuffle(mydata$income) 
summary(lm(shuffled_income ~ age, data = mydata)) # the new age model
## 
## Call:
## lm(formula = shuffled_income ~ age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30499 -17977 -12941  -1437 353824 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44818.12     544.79  82.267   <2e-16 ***
## age            15.36      13.31   1.154    0.249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32770 on 32559 degrees of freedom
## Multiple R-squared:  4.088e-05,  Adjusted R-squared:  1.016e-05 
## F-statistic: 1.331 on 1 and 32559 DF,  p-value: 0.2486
# shuffle the age variable
sample_b1 <- b1(income ~ age, data = mydata)
sdob1 <- do(100) * b1(income ~ shuffle(age), data = mydata) # a sample with 100 obs
# alternatively: sdob1 <- do(100) * b1(income ~ age, data = resample(mydata))

# how often do we get a b1 value as or more extreme as the sample b1?
tally(sdob1$b1 > sample_b1 | sdob1$b1 < -sample_b1)
## X
##  TRUE FALSE 
##     0   100
# visualize it
gf_histogram(~b1, data = sdob1)

Now, let’s look at the summary of the age model again. What do these values mean?

summary(age_model)
## 
## Call:
## lm(formula = income ~ age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50676 -17613 -10443   1180 356172 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27889.79     534.98   52.13   <2e-16 ***
## age           454.13      13.07   34.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32180 on 32559 degrees of freedom
## Multiple R-squared:  0.03574,    Adjusted R-squared:  0.03571 
## F-statistic:  1207 on 1 and 32559 DF,  p-value: < 2.2e-16
  • Estimate: our b0 and b1 values, our estimated parameters.
  • Std. Error stands for the standard error. It is the standard deviation of the sampling distribution of the parameter estimate. We have SE = SD/sqrt(n).
  • t value is the standardized distance from 0 in the student t distribution. We have t-value = Estimate / SE. The larger the t-value, the more extreme the estimated coefficient is in the sampling distribution of an empty model.
  • Pr(>|t|) refers to the p-value, which is the probability of getting a parameter estimate as extreme or more extreme than the sample estimate given the assumption that the empty model is true (null hypothesis). We prefer a model with a lower p-value.

Confidence Interval (CI): We know that in the normal distribution, 95% of the distribution is within 1.96 standard deviations. Therefore, for example, the lower bound of b1 is b1-1.96*SE, and the upper bound of b1 is b1+1.96*SE. To get the CI of b0 and b1 in R, we can run:

confint(age_model) # get the 95% CI of the age model
##                  2.5 %     97.5 %
## (Intercept) 26841.2112 28938.3641
## age           428.5014   479.7491
c(454.13 - 1.96 * 13.07, 454.13 + 1.96 * 13.07) # calculate the CI manually
## [1] 428.5128 479.7472
confint(age_model, level = 0.99) # get the 99% CI of the age model
##                  0.5 %     99.5 %
## (Intercept) 26511.6951 29267.8803
## age           420.4491   487.8015

From the first output, we are 95% confident that the true value of b1 is between 428.5 and 479.7. It does not include 0, we may reject the null hypothesis with 95% confidence.

Multivariate Model

The multivariate model is a regression model with multiple predictors. For example, it makes sense to explain variation in income using both age and gender. In a word form: Income = Age + Gender + Error. In a general linear model form: \(\text{Income}_i = b_0 + b_1*\text{Age}_i + b_2*\text{Gender}_i + e_i\).

# explain income using age and education
age_gender_model <- lm(income ~ age + gender, data = mydata) 
summary(age_gender_model) 
## 
## Call:
## lm(formula = income ~ age + gender, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47908 -18283  -9735   3197 352409 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21877.91     567.30   38.56   <2e-16 ***
## age           420.69      12.96   32.47   <2e-16 ***
## genderMale  10911.11     375.68   29.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31770 on 32558 degrees of freedom
## Multiple R-squared:  0.06009,    Adjusted R-squared:  0.06003 
## F-statistic:  1041 on 2 and 32558 DF,  p-value: < 2.2e-16

So, in a GLM form, the model is: \(\text{Income}_i = 21877.9 + 420.7*\text{Age}_i + 10911.1*\text{Gender}_i + e_i\). How should we interpret these coefficients? \(b_1\) is the change in income per a one-unit change in age while holding gender fixed, and \(b_2\) is the difference in income between male and female (the reference level) while holding age fixed.

# get the ANOVA table
supernova(age_gender_model)
##  Analysis of Variance Table (Type III SS)
##  Model: income ~ age + gender
## 
##                                     SS    df           MS        F   PRE     p
##  ------ --------------- | ------------ ----- ------------ -------- ----- -----
##   Model (error reduced) | 2.100731e+12     2 1.050366e+12 1040.721 .0601 .0000
##     age                 | 1.063725e+12     1 1.063725e+12 1053.958 .0314 .0000
##  gender                 | 8.513584e+11     1 8.513584e+11  843.541 .0253 .0000
##   Error (from model)    | 3.285972e+13 32558 1.009267e+09                     
##  ------ --------------- | ------------ ----- ------------ -------- ----- -----
##   Total (empty model)   | 3.496045e+13 32560 1.073724e+09

Compared to the age model, the age+gender model has a lower F-ratio but a higher PRE.

Interaction Term

We can add an interaction term between age and gender to the model.

age_gender_interaction <- lm(income ~ age*gender, data = mydata) 
summary(age_gender_interaction) 
## 
## Call:
## lm(formula = income ~ age * gender, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54344 -17044  -9119   1926 353101 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    30518.03     859.04  35.526   <2e-16 ***
## age              186.28      21.79   8.551   <2e-16 ***
## genderMale     -2748.16    1088.45  -2.525   0.0116 *  
## age:genderMale   361.70      27.06  13.366   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31680 on 32557 degrees of freedom
## Multiple R-squared:  0.06522,    Adjusted R-squared:  0.06513 
## F-statistic: 757.2 on 3 and 32557 DF,  p-value: < 2.2e-16